The objective of this analysis is to reduce the number and duration of the absences of the company employees. To achieve that, we will work with a dataset created by the HR department with all the recent absences together with its duration and a set of features that they consider important. They have collected 593 observations on a period of xx time.
The steps followed are: - Identify the most important features and their relation/impact with long absences. During this stage, we cleaned the dataset, do some plots to better understand the data and get insights to finally perform some features engineering and features selection. - Provide a Machine Learning model to identify beforehand absences that might be prolonged more than 5 hours. This is a Classification Problem and we will optimize our model for Accuracy. Considering the aplication of the model we decided this is a good evaluation variable over Precision or Recall.
This is an iterative process, every time we do some Feature engineering and we run the model again and so on until we find the most accurate model. On the script, you will see different transformations that have been done to the dataset (bucketizing, grouping obs with low freq, creating new features, etc.). Some of them, improved the model and others havent, nevertheless we kept them all with the detailed explanation of our observations and conclusions. the actions performed might be trigered by insights on the data visualizations section or by general knowledge on this field.
During the analysis we will be calling the observations with prolonged absence (> 5 hours) as possitive or ones while those with short absences are called negative or zeros.
rm(list = ls()) # Errase all variables from memory
#path<- "/Users/estevezmelgarejoramon/Desktop/ML2/FirstAssignment/Absenteeism"
#setwd(path)
training <- read.csv(file = file.path("Absenteeism_at_work_classification_training.csv"), header = TRUE, dec = ".", sep = ";")
test <- read.csv(file = file.path("Absenteeism_at_work_classification_test.csv"), header = TRUE, dec = ".", sep = ";")
testId <- read.csv(file = file.path("Absenteeism_at_work_classification_test.csv"), header = TRUE, dec = ".", sep = ";") %>% select_at("ID")
set.seed(123)
1.2 Our functions
We have created a series of functions that compute the models we will be using in this assignment aswell as a function that will compare results among different well known models to define the most suitable one. Due to the size and composition of the dataset and the kind of analysis we need to do, we will test the following models: Linear Model Lasso Ridge Decission Tree XGBoost RandomForest
lm.model <- function(training_dataset) {
# Create a training control configuration that applies a 5-fold cross validation
train_control_config <- trainControl(method = "repeatedcv",
number = 5,
repeats = 1,
returnResamp = "all",
verboseIter = F,
sampling = "up")
# Fit a glm model to the input training data
set.seed(123)
this.model <- train(Absenteeism ~ .,
data = training_dataset, #the dataset is what will change
method = "glm",
metric = "Accuracy",
preProc = c("center", "scale"),
trControl=train_control_config)
return(this.model)
}
Glmnet.model <- function(training_dataset) {
trainControlLasso <- trainControl(method = "repeatedcv",
number = 5,
repeats = 1,
returnResamp = "all",
verboseIter = F,
sampling = "up")
gridGlmnet<- expand.grid(alpha = seq(0, 1, by = 0.01),
lambda = seq(0.001, 0.05, by = 0.01))
set.seed(123)
this.model <- train(Absenteeism ~ ., data = training_dataset,
method = "glmnet",
metric = "Accuracy",
preProc = c("center", "scale"),
trControl = trainControlLasso,
tuneGrid = gridGlmnet
)
return(this.model)
}
Tree.model <- function(training_dataset){
trainControlTree <- trainControl(method = "repeatedcv",
number = 5,
repeats = 1,
returnResamp = "all",
verboseIter = F,
sampling = "up")
set.seed(123)
this.model <- train(Absenteeism ~ ., data = training_dataset,
method = "rpart",
metric = "Accuracy",
preProc = c("center", "scale"),
trControl = trainControlTree,
tuneLength = 200 # for random search
)
return(this.model)
}
XGB.model <- function(training_dataset){
trainControlXGB <- trainControl(method = "repeatedcv",
number = 5,
repeats = 1,
returnResamp = "all",
verboseIter = F,
allowParallel = TRUE,
sampling = "up")
registerDoMC(cores=4)
set.seed(123)
this.model <- train(Absenteeism ~ ., data = training_dataset,
method = "xgbTree",
metric = "Accuracy",
preProc = c("center", "scale"),
trControl = trainControlXGB,
tuneLength = 20, # for random search
num.threads = 4
)
return(this.model)
}
RandomForest.model <- function(training_dataset){
trainControlRandomForest <- trainControl(method = "repeatedcv",
number = 5,
repeats = 1,
returnResamp = "all",
verboseIter = F,
allowParallel = T,
sampling = "up")
set.seed(123)
registerDoMC(cores=4)
this.model <- train(Absenteeism ~ ., data = training_dataset,
method = "rf",
metric = "Accuracy",
preProc = c("center", "scale"),
trControl = trainControlRandomForest,
tuneLength = 50, # for random search
num.threads = 4
)
return(this.model)
}
Model.compare <- function(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "None"){
## This function compares the accuracy of lm.model, Glmnet.model, Tree.model and RandomForest.model
accuracies <- c(max(LmModel$results$Accuracy), max(GlmnetModel$results$Accuracy),
max(TreeModel$results$Accuracy), max(RandomForestModel$results$Accuracy))
models <- c("Linear", "Glmnet", "Tree", "RandomForest")
treatment <- rep(TreatmentDescription, 4)
df <- data.frame(Model = models,
Accuracy = accuracies,
Treatment = treatment)
return(df)
}
This is done for better understanding of the data and to improve the visualization on the graphs.
Because we will do some recoding we first combine both dataset so that we only run the process once and avoid having differences on features between training and test dataset.
test$Absenteeism <- 0 # Test data has no target variable.
dataset <- rbind(training, test)
dataset <- dataset[,!(names(dataset) %in% c("ID", "ID.Worker"))] # Removing ID variables.
# We will recode variable names so that they fit better in plots
names(dataset) <- recode(names(dataset),
"Transportation.expense" = "TranspExp",
"Hit.target" ="Hit",
"Social.smoker" = "Smoker",
"Distance.from.Residence.to.Work" = "WorkDist",
"Disciplinary.failure" = "DiscFailure",
"Month.of.absence" = "MonthOfAbs",
"ID.Worker" = "WorkerID",
"Service.time" = "ServTime",
"Day.of.the.week" = "WeekDay",
"Work.load.Average.day" = "WorkLoadAvgDay",
"Social.drinker" = "Drinker",
"Body.mass.index" = "BodyMassIndex",
"Reason.for.absence" = "Reason")
dataset$Reason <- recode(dataset$Reason,
'0'='InfParasDis', '1'='Neoplasms', '2'='DisOfBblood', '3'='Endoc&metDis', '4'='MentAndBehavDisor',
'5'='NervSys', '6'='EyeAndBdnexa', '7'='EarAndMast', '8'='CircSys', '9'='RespSys', '10'='DigSys',
'11'='SkinAndSaubcutTiss', '12'='MuscuAndConnect', '13'='Genitourinary', '14'='PregnanBirthAndPuerp',
'15'='Perinatal', '16'='Congenital', '17'='AbnormalFindings', '18'='InjuryPoisoningAndOther',
'19'='MorbidityAndMortality', '21'='FactorsinfluencingHealth', '22'='patientFollowUp',
'23'='Consultation', '24'='BloodDonation', '25'='LabExamination', '26'='Unjustified',
'27'='Physiotherapy', '28'='Dental')
dataset$Seasons <- recode(dataset$Seasons,'1'='summer','2'='autumn','3'='winter','4'='spring')
dataset$Education <- recode(dataset$Education, '1'='highschool','2'='graduate','3'='postgraduate','4'='masterAndPhD')
dataset$WeekDay <- recode(dataset$WeekDay, '2'='Mon','3'='Tue','4'='Wed','5'='Thr', '6'='Fri')
summary(dataset)
## Reason MonthOfAbs WeekDay Seasons
## Length:740 Min. : 0.000 Length:740 Length:740
## Class :character 1st Qu.: 3.000 Class :character Class :character
## Mode :character Median : 6.000 Mode :character Mode :character
## Mean : 6.324
## 3rd Qu.: 9.000
## Max. :12.000
## TranspExp WorkDist ServTime Age
## Min. :118.0 Min. : 5.00 Min. : 1.00 Min. :27.00
## 1st Qu.:179.0 1st Qu.:16.00 1st Qu.: 9.00 1st Qu.:31.00
## Median :225.0 Median :26.00 Median :13.00 Median :37.00
## Mean :221.3 Mean :29.63 Mean :12.55 Mean :36.45
## 3rd Qu.:260.0 3rd Qu.:50.00 3rd Qu.:16.00 3rd Qu.:40.00
## Max. :388.0 Max. :52.00 Max. :29.00 Max. :58.00
## WorkLoadAvgDay Hit DiscFailure Education
## Min. :205.9 Min. : 81.00 Min. :0.00000 Length:740
## 1st Qu.:244.4 1st Qu.: 93.00 1st Qu.:0.00000 Class :character
## Median :264.2 Median : 95.00 Median :0.00000 Mode :character
## Mean :271.5 Mean : 94.59 Mean :0.05405
## 3rd Qu.:294.2 3rd Qu.: 97.00 3rd Qu.:0.00000
## Max. :378.9 Max. :100.00 Max. :1.00000
## Son Drinker Smoker Pet
## Min. :0.000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :1.000 Median :1.0000 Median :0.00000 Median :0.0000
## Mean :1.019 Mean :0.5676 Mean :0.07297 Mean :0.7459
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :4.000 Max. :1.0000 Max. :1.00000 Max. :8.0000
## Weight Height BodyMassIndex Absenteeism
## Min. : 56.00 Min. :163.0 Min. :19.00 Min. :0.0000
## 1st Qu.: 69.00 1st Qu.:169.0 1st Qu.:24.00 1st Qu.:0.0000
## Median : 83.00 Median :170.0 Median :25.00 Median :0.0000
## Mean : 79.04 Mean :172.1 Mean :26.68 Mean :0.2946
## 3rd Qu.: 89.00 3rd Qu.:172.0 3rd Qu.:31.00 3rd Qu.:1.0000
## Max. :108.00 Max. :196.0 Max. :38.00 Max. :1.0000
Check variable types (or factor or numeric).
We see that there are some columns that need to be converted to factors. booleans (0 or 1) are set as factors too.
# factor will convert the column to categorical. lapply applies the factor function to each of the columns in categorical_columns
categorical_columns <- c('Reason', 'MonthOfAbs', 'WeekDay', 'Seasons', 'DiscFailure', 'Drinker', 'Smoker', 'Absenteeism', 'Education')
dataset[categorical_columns] <- lapply(dataset[categorical_columns], factor)
summary(dataset)
## Reason MonthOfAbs WeekDay Seasons
## Consultation :149 3 : 87 Fri:144 autumn:192
## Dental :112 2 : 72 Mon:161 spring:195
## Physiotherapy : 69 10 : 71 Thr:125 summer:170
## Genitourinary : 55 7 : 67 Tue:154 winter:183
## InfParasDis : 43 5 : 64 Wed:156
## MorbidityAndMortality: 40 11 : 63
## (Other) :272 (Other):316
## TranspExp WorkDist ServTime Age
## Min. :118.0 Min. : 5.00 Min. : 1.00 Min. :27.00
## 1st Qu.:179.0 1st Qu.:16.00 1st Qu.: 9.00 1st Qu.:31.00
## Median :225.0 Median :26.00 Median :13.00 Median :37.00
## Mean :221.3 Mean :29.63 Mean :12.55 Mean :36.45
## 3rd Qu.:260.0 3rd Qu.:50.00 3rd Qu.:16.00 3rd Qu.:40.00
## Max. :388.0 Max. :52.00 Max. :29.00 Max. :58.00
##
## WorkLoadAvgDay Hit DiscFailure Education
## Min. :205.9 Min. : 81.00 0:700 graduate : 46
## 1st Qu.:244.4 1st Qu.: 93.00 1: 40 highschool :611
## Median :264.2 Median : 95.00 masterAndPhD: 4
## Mean :271.5 Mean : 94.59 postgraduate: 79
## 3rd Qu.:294.2 3rd Qu.: 97.00
## Max. :378.9 Max. :100.00
##
## Son Drinker Smoker Pet Weight
## Min. :0.000 0:320 0:686 Min. :0.0000 Min. : 56.00
## 1st Qu.:0.000 1:420 1: 54 1st Qu.:0.0000 1st Qu.: 69.00
## Median :1.000 Median :0.0000 Median : 83.00
## Mean :1.019 Mean :0.7459 Mean : 79.04
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.: 89.00
## Max. :4.000 Max. :8.0000 Max. :108.00
##
## Height BodyMassIndex Absenteeism
## Min. :163.0 Min. :19.00 0:522
## 1st Qu.:169.0 1st Qu.:24.00 1:218
## Median :170.0 Median :25.00
## Mean :172.1 Mean :26.68
## 3rd Qu.:172.0 3rd Qu.:31.00
## Max. :196.0 Max. :38.00
##
str(dataset)
## 'data.frame': 740 obs. of 20 variables:
## $ Reason : Factor w/ 28 levels "AbnormalFindings",..: 28 14 5 9 5 22 18 22 20 20 ...
## $ MonthOfAbs : Factor w/ 13 levels "0","1","2","3",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ WeekDay : Factor w/ 5 levels "Fri","Mon","Thr",..: 4 4 5 3 3 1 2 2 2 4 ...
## $ Seasons : Factor w/ 4 levels "autumn","spring",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ TranspExp : int 289 118 179 279 289 361 155 235 260 260 ...
## $ WorkDist : int 36 13 51 5 36 52 12 11 50 50 ...
## $ ServTime : int 13 18 18 14 13 3 14 14 11 11 ...
## $ Age : int 33 50 38 39 33 28 34 37 36 36 ...
## $ WorkLoadAvgDay: num 240 240 240 240 240 ...
## $ Hit : int 97 97 97 97 97 97 97 97 97 97 ...
## $ DiscFailure : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
## $ Education : Factor w/ 4 levels "graduate","highschool",..: 2 2 2 2 2 2 2 4 2 2 ...
## $ Son : int 2 1 0 2 2 1 2 1 4 4 ...
## $ Drinker : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
## $ Smoker : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
## $ Pet : int 1 0 0 0 1 4 0 1 0 0 ...
## $ Weight : int 90 98 89 68 90 80 95 88 65 65 ...
## $ Height : int 172 178 170 168 172 172 196 172 168 168 ...
## $ BodyMassIndex : int 30 31 31 24 30 27 25 29 23 23 ...
## $ Absenteeism : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 2 2 2 2 ...
Lets check for duplicates in the training data and if there are, remove them to avoid overfitting.
training <- dataset[1:593,]
test <- dataset[594:740,]
print("Number of rows duplicated :")
## [1] "Number of rows duplicated :"
nrow(training) - nrow(distinct(training))
## [1] 35
training <- distinct(training)
On previous analysis we infere that the data is complete. just to double check there are no missing values or NAs:
colSums(is.na(dataset))
## Reason MonthOfAbs WeekDay Seasons TranspExp
## 0 0 0 0 0
## WorkDist ServTime Age WorkLoadAvgDay Hit
## 0 0 0 0 0
## DiscFailure Education Son Drinker Smoker
## 0 0 0 0 0
## Pet Weight Height BodyMassIndex Absenteeism
## 0 0 0 0 0
We will compute our models without any feature ingenieerng and compare them later with any modification we do to our data.
training <- dataset[1:593,]
test <- dataset[594:740,]
# Creating an empty df that will contain the comparisons of our models
comparisonDF <- data.frame(Model = character(), Accuracy = numeric(), Treatment = character())
# Fitting our models
LmModel <- lm.model(training)
GlmnetModel <- Glmnet.model(training)
TreeModel <- Tree.model(training)
RandomForestModel <- RandomForest.model(training)
comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "None"))
comparisonDF
## Model Accuracy Treatment
## 1 Linear 0.8246119 None
## 2 Glmnet 0.8532118 None
## 3 Tree 0.8364193 None
## 4 RandomForest 0.8297109 None
Glmnet is the model with highest Accuracy. Result is 0.853 with minimum feature engineering
Before any feature ingeneering we will visualise our data so see if we can observe any interesing pattern.
Lets plot a histogram of all continious variables to see their distributions
training %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Lets now create a “violin” plot of continious variables to compare the behaviour of positive and negative populations on our training set. If there is a feature with very similar distribution, we might consider removing it of the analysis.
# Violin plots
DataForViolinPLot <- training %>%
keep(is.numeric) %>%
cbind(training$Absenteeism) %>%
melt()
## Using training$Absenteeism as id variables
names(DataForViolinPLot)[1] <- "Absenteeism"
ggplot(DataForViolinPLot, aes(x = Absenteeism, y = value)) +
facet_wrap(~ variable, scales = "free") +
geom_violin()
In this plot we can see the distribution of continious variables for each target class. We would expect that good explanatory variables have different shapes having different distribution for each class. We can see that variables such as weight for axample have nearly the same shape therefore they are not very promissing. Never the less we will compute some statistical analisys to proove our visual analisys later on.
Now we will check corretation among cont. variables
# feature
correlation <- training %>% keep(is.numeric) %>% cor(method = "pearson")
corrplot(correlation, type = 'full', method = 'circle')
We can see that weight and bodymass index are correlated, it makes sense as BMI = Kg/m^2. other correlation we see is between Son and Pets with Travel distance: this also makes sence as bigger houses are in the suburbs.
# Correlation between bodymass index and weight
print("Bodymass Index - weight pearson correlation: ")
## [1] "Bodymass Index - weight pearson correlation: "
correlation["BodyMassIndex", "Weight"]
## [1] 0.9092747
Lets boxplot all continious variables.
# Box plot of numeric variables
numVars<- dplyr::select_if(training, is.numeric)
for (i in unique(names(numVars))){
dataForPlot <- numVars %>% select_at(i)
plot <- ggplot(dataForPlot, aes_string(y = i)) + stat_boxplot()
show(plot)
}
We can see that there are some outliers in our data but nothing very extreme. This issue will be treated down the road.
First we count the number of unique values (categories) per feature
sapply(dataset %>% keep(is.factor), function(x){length(unique(x))})
## Reason MonthOfAbs WeekDay Seasons DiscFailure Education
## 28 13 5 4 2 4
## Drinker Smoker Absenteeism
## 2 2 2
Observations: - 13 months. this will need to be adressed later. See if its necesary to transform and evaluate the impact of leaving 3 values with month 0. - too many reasons for absence
On the following graph we will compare the behaviour of categorical variables between false (0) and possitive (1) values.
# plot of categorical variables
FactorVars<- dplyr::select_if(training, is.factor)
for (i in unique(names(FactorVars))[names(FactorVars) != "Absenteeism"]){
dataForPlot <- FactorVars %>% select_at(c(i, "Absenteeism"))
plot <- ggplot(dataForPlot, aes_string(i, "Absenteeism")) + geom_count() + coord_flip()
show(plot)
}
First thing we see is that categories like “master and PHD” have little frequency, maybe it makes sense to group them an make only two calsses, highschool & University.
Also its seems that there are too many reasons of absenteesm, and some of them with very little frequency. We could create a goup called “Others” to gather all low frec classes (less than 1%)
There are some interesting patterns at least in drinkers, discFailure, seasson and weekday variables. On the following barplots we can observe the behaviour: - although the number of observations per season is equally distributed, we observe a significant higher amount of positive values during Winter - although the number of observations per day of the week if similarly distributed, we observe a significant higher amount of positive values on Mondays and lower on Thursdays. - there is a very low amount of employees with disciplinary failures on the training set and none of the employees with disc.failures has an absenteeism of more than 5 hours. - diferent from smokers, 60% of the employees who drinks have extended leaves.
true_values <- training[training$Absenteeism == 1,]
ggplot(data = true_values, aes(x=Seasons)) + geom_bar()
ggplot(data = true_values, aes(x=WeekDay)) + geom_bar() + ggtitle("Monday, bloody Monday....")
ggplot(data = true_values, aes(x=Drinker)) + geom_bar()
ggplot(data = true_values, aes(x=DiscFailure)) + geom_bar()
Finally, lets focus on the frequencies of Reason for absenteesm and education
reasonFreq <- freq(training$Reason, digits = 5, valid = F) %>% tibble::rownames_to_column("Reason")
EducatioFreq <- freq(training$Education, digits = 5, valid = F) %>% tibble::rownames_to_column("Education")
reasonFreq
## Reason n %
## 1 AbnormalFindings 1 0.16863
## 2 BloodDonation 3 0.50590
## 3 CircSys 6 1.01180
## 4 Congenital 3 0.50590
## 5 Consultation 110 18.54975
## 6 Dental 96 16.18887
## 7 DigSys 23 3.87858
## 8 DisOfBblood 1 0.16863
## 9 EarAndMast 13 2.19224
## 10 Endoc&metDis 1 0.16863
## 11 EyeAndBdnexa 8 1.34907
## 12 FactorsinfluencingHealth 5 0.84317
## 13 Genitourinary 47 7.92580
## 14 InfParasDis 32 5.39629
## 15 InjuryPoisoningAndOther 17 2.86678
## 16 LabExamination 28 4.72175
## 17 MentAndBehavDisor 1 0.16863
## 18 MorbidityAndMortality 28 4.72175
## 19 MuscuAndConnect 8 1.34907
## 20 Neoplasms 14 2.36088
## 21 NervSys 3 0.50590
## 22 patientFollowUp 26 4.38449
## 23 Perinatal 1 0.16863
## 24 Physiotherapy 54 9.10624
## 25 PregnanBirthAndPuerp 15 2.52951
## 26 RespSys 4 0.67454
## 27 SkinAndSaubcutTiss 21 3.54132
## 28 Unjustified 24 4.04722
EducatioFreq
## Education n %
## 1 graduate 36 6.07083
## 2 highschool 489 82.46206
## 3 masterAndPhD 3 0.50590
## 4 postgraduate 65 10.96121
ggplot(reasonFreq , aes(x = reorder(Reason, - `%`), y = `%`)) +
geom_col() +
ylab("%") +
xlab("Reason") +
coord_flip()
ggplot(EducatioFreq , aes(x = reorder(Education, - `%`), y = `%`)) +
geom_col() +
ylab("%") +
xlab("Education") +
coord_flip()
As we saw in the boxPlot of continious variables, there are some outliers that might be affecting our results. Lets remove them:
for (col in names(training)) {
if (is.numeric(training[[col]]) && col != "Absenteeism"){
to_remove <- boxplot.stats(training[[col]], coef = 4)$out
training_no_outliers <- training[!training[[col]] %in% to_remove, ]
print(col)
print(to_remove)
print(ggplot(training_no_outliers, aes_string(y=col))+ geom_boxplot(width=0.1))
}
}
## [1] "TranspExp"
## integer(0)
## [1] "WorkDist"
## integer(0)
## [1] "ServTime"
## integer(0)
## [1] "Age"
## integer(0)
## [1] "WorkLoadAvgDay"
## numeric(0)
## [1] "Hit"
## integer(0)
## [1] "Son"
## integer(0)
## [1] "Pet"
## [1] 8 8 8 8
## [1] "Weight"
## integer(0)
## [1] "Height"
## [1] 196 185 196 196 196 196 196 185 196 196 196 196 196 196 196 185 196
## [18] 185 196 185 196 196 196 196 196 196 196
## [1] "BodyMassIndex"
## integer(0)
As we can see, for a coeficient of 4, only number of pets = 8 and heigth = 196, 185 are considered as outliers. In my opinion these values are not vey extreme, specially heigth = 196, 185. Never the less we will test our models without outliers.
# Fitting our models without outliers
LmModel <- lm.model(training_no_outliers)
GlmnetModel <- Glmnet.model(training_no_outliers)
TreeModel <- Tree.model(training_no_outliers)
RandomForestModel <- RandomForest.model(training_no_outliers)
comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "RemovingOutliersCoef4"))
comparisonDF
## Model Accuracy Treatment
## 1 Linear 0.8246119 None
## 2 Glmnet 0.8532118 None
## 3 Tree 0.8364193 None
## 4 RandomForest 0.8297109 None
## 5 Linear 0.8246119 RemovingOutliersCoef4
## 6 Glmnet 0.8532118 RemovingOutliersCoef4
## 7 Tree 0.8364193 RemovingOutliersCoef4
## 8 RandomForest 0.8297109 RemovingOutliersCoef4
rm(training_no_outliers)
It seems that removing them has no influence in acuracy. Considering our training set has less than 600 observations and there are not significant outliers that can be atributed to a mistake while loading the information we would not consider applying lower factors than 4.
Considering the low improvement after removing the outliers, we will start the feature engineering with them and continue monitoring them.
train_buckson <- training
train_buckson$Son <-as.factor(ifelse(train_buckson$Son == 0, "No", "Yes"))
LmModel <- lm.model(train_buckson)
GlmnetModel <- Glmnet.model(train_buckson)
TreeModel <- Tree.model(train_buckson)
RandomForestModel <- RandomForest.model(train_buckson)
comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "bucketizing Sons"))
comparisonDF
## Model Accuracy Treatment
## 1 Linear 0.8246119 None
## 2 Glmnet 0.8532118 None
## 3 Tree 0.8364193 None
## 4 RandomForest 0.8297109 None
## 5 Linear 0.8246119 RemovingOutliersCoef4
## 6 Glmnet 0.8532118 RemovingOutliersCoef4
## 7 Tree 0.8364193 RemovingOutliersCoef4
## 8 RandomForest 0.8297109 RemovingOutliersCoef4
## 9 Linear 0.8195841 bucketizing Sons
## 10 Glmnet 0.8532260 bucketizing Sons
## 11 Tree 0.8364193 bucketizing Sons
## 12 RandomForest 0.8296966 bucketizing Sons
We observe a slight improvement on Glmnet performance so we will keep the change.
training$Son <-as.factor(ifelse(training$Son == 0, "No", "Yes"))
Lets now try with pets
train_buckpet <- training
train_buckpet$Pet <-as.factor(ifelse(train_buckpet$Pet == 0, "No", "Yes"))
LmModel <- lm.model(train_buckpet)
GlmnetModel <- Glmnet.model(train_buckpet)
TreeModel <- Tree.model(train_buckpet)
RandomForestModel <- RandomForest.model(train_buckpet)
comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "bucketizing Pets"))
comparisonDF
## Model Accuracy Treatment
## 1 Linear 0.8246119 None
## 2 Glmnet 0.8532118 None
## 3 Tree 0.8364193 None
## 4 RandomForest 0.8297109 None
## 5 Linear 0.8246119 RemovingOutliersCoef4
## 6 Glmnet 0.8532118 RemovingOutliersCoef4
## 7 Tree 0.8364193 RemovingOutliersCoef4
## 8 RandomForest 0.8297109 RemovingOutliersCoef4
## 9 Linear 0.8195841 bucketizing Sons
## 10 Glmnet 0.8532260 bucketizing Sons
## 11 Tree 0.8364193 bucketizing Sons
## 12 RandomForest 0.8296966 bucketizing Sons
## 13 Linear 0.8178749 bucketizing Pets
## 14 Glmnet 0.8532260 bucketizing Pets
## 15 Tree 0.8364193 bucketizing Pets
## 16 RandomForest 0.8296824 bucketizing Pets
rm(train_buckpet)
No improvement here.. so we discard changes
# Let's remove reasons with less or equal than 1% frequency
reasonsToGroup <- reasonFreq[reasonFreq$`%` <= 1, "Reason"]
training_buckreasson <- training
training_buckreasson$Reason <- as.character(training_buckreasson$Reason) # We have to convert to character and then back
training_buckreasson[training_buckreasson$Reason %in% reasonsToGroup, "Reason"] <- "OTHERS"
training_buckreasson$Reason <- as.factor(training_buckreasson$Reason)
freq(training_buckreasson$Reason, digits = 5, valid = F)
## n %
## CircSys 6 1.01180
## Consultation 110 18.54975
## Dental 96 16.18887
## DigSys 23 3.87858
## EarAndMast 13 2.19224
## EyeAndBdnexa 8 1.34907
## Genitourinary 47 7.92580
## InfParasDis 32 5.39629
## InjuryPoisoningAndOther 17 2.86678
## LabExamination 28 4.72175
## MorbidityAndMortality 28 4.72175
## MuscuAndConnect 8 1.34907
## Neoplasms 14 2.36088
## OTHERS 23 3.87858
## patientFollowUp 26 4.38449
## Physiotherapy 54 9.10624
## PregnanBirthAndPuerp 15 2.52951
## SkinAndSaubcutTiss 21 3.54132
## Unjustified 24 4.04722
ReasonFrec <- freq(training_buckreasson$Reason, digits = 5, valid = F)
ReasonFrec <- tibble::rownames_to_column(ReasonFrec, "Reason") # La funcion frec() deja Reasons como rownames, asi los hago una columna para el ggplot
ReasonFrec
## Reason n %
## 1 CircSys 6 1.01180
## 2 Consultation 110 18.54975
## 3 Dental 96 16.18887
## 4 DigSys 23 3.87858
## 5 EarAndMast 13 2.19224
## 6 EyeAndBdnexa 8 1.34907
## 7 Genitourinary 47 7.92580
## 8 InfParasDis 32 5.39629
## 9 InjuryPoisoningAndOther 17 2.86678
## 10 LabExamination 28 4.72175
## 11 MorbidityAndMortality 28 4.72175
## 12 MuscuAndConnect 8 1.34907
## 13 Neoplasms 14 2.36088
## 14 OTHERS 23 3.87858
## 15 patientFollowUp 26 4.38449
## 16 Physiotherapy 54 9.10624
## 17 PregnanBirthAndPuerp 15 2.52951
## 18 SkinAndSaubcutTiss 21 3.54132
## 19 Unjustified 24 4.04722
ggplot(ReasonFrec , aes(x = reorder(Reason, - `%`), y = `%`)) +
geom_col() +
ylab("%") +
xlab("Reason") +
coord_flip() +
ggtitle("Reasons with 'OTHERS' goup")
LmModel <- lm.model(training_buckreasson)
GlmnetModel <- Glmnet.model(training_buckreasson)
TreeModel <- Tree.model(training_buckreasson)
RandomForestModel <- RandomForest.model(training_buckreasson)
comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "bucketizing Reassons"))
comparisonDF
## Model Accuracy Treatment
## 1 Linear 0.8246119 None
## 2 Glmnet 0.8532118 None
## 3 Tree 0.8364193 None
## 4 RandomForest 0.8297109 None
## 5 Linear 0.8246119 RemovingOutliersCoef4
## 6 Glmnet 0.8532118 RemovingOutliersCoef4
## 7 Tree 0.8364193 RemovingOutliersCoef4
## 8 RandomForest 0.8297109 RemovingOutliersCoef4
## 9 Linear 0.8195841 bucketizing Sons
## 10 Glmnet 0.8532260 bucketizing Sons
## 11 Tree 0.8364193 bucketizing Sons
## 12 RandomForest 0.8296966 bucketizing Sons
## 13 Linear 0.8178749 bucketizing Pets
## 14 Glmnet 0.8532260 bucketizing Pets
## 15 Tree 0.8364193 bucketizing Pets
## 16 RandomForest 0.8296824 bucketizing Pets
## 17 Linear 0.8195983 bucketizing Reassons
## 18 Glmnet 0.8481698 bucketizing Reassons
## 19 Tree 0.8364193 bucketizing Reassons
## 20 RandomForest 0.8279447 bucketizing Reassons
rm(training_buckreasson)
We see an improvement of Random forest Mode, but out best model Glmnet, has decreased accuracy. We decide not to make the change.
training_groupEd <- training
training_groupEd$Higher_Education <- as.factor(recode(training_groupEd$Education, 'graduate'='Yes','postgraduate'='Yes','masterAndPhD'='Yes','highschool'='No'))
training_groupEd[1:594,] %>%
ggplot(aes(x = Higher_Education)) + geom_bar()
# Fitting our models after grouping education
LmModel <- lm.model(training_groupEd)
GlmnetModel <- Glmnet.model(training_groupEd)
TreeModel <- Tree.model(training_groupEd)
RandomForestModel <- RandomForest.model(training_groupEd)
comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "bucketizing Education"))
comparisonDF
## Model Accuracy Treatment
## 1 Linear 0.8246119 None
## 2 Glmnet 0.8532118 None
## 3 Tree 0.8364193 None
## 4 RandomForest 0.8297109 None
## 5 Linear 0.8246119 RemovingOutliersCoef4
## 6 Glmnet 0.8532118 RemovingOutliersCoef4
## 7 Tree 0.8364193 RemovingOutliersCoef4
## 8 RandomForest 0.8297109 RemovingOutliersCoef4
## 9 Linear 0.8195841 bucketizing Sons
## 10 Glmnet 0.8532260 bucketizing Sons
## 11 Tree 0.8364193 bucketizing Sons
## 12 RandomForest 0.8296966 bucketizing Sons
## 13 Linear 0.8178749 bucketizing Pets
## 14 Glmnet 0.8532260 bucketizing Pets
## 15 Tree 0.8364193 bucketizing Pets
## 16 RandomForest 0.8296824 bucketizing Pets
## 17 Linear 0.8195983 bucketizing Reassons
## 18 Glmnet 0.8481698 bucketizing Reassons
## 19 Tree 0.8364193 bucketizing Reassons
## 20 RandomForest 0.8279447 bucketizing Reassons
## 21 Linear 0.8195841 bucketizing Education
## 22 Glmnet 0.8532260 bucketizing Education
## 23 Tree 0.8364193 bucketizing Education
## 24 RandomForest 0.8279163 bucketizing Education
The tendency is that all models improve si we keep the change and move on!
training <- training_groupEd
rm(training_groupEd)
# day next to weekend (Monday or Friday)
training$MonOrFri <- as.factor(recode(training$WeekDay, '1' = 'Yes', '2' = 'No', '3' = 'No', '4'='No','5'='Yes','6'='Yes'))
# avg. mile transportation cost
training$Transp_avg_cost <- as.numeric(training$TranspExp/training$WorkDist)
# Log of distance to work
training$log_dist <-log(training$WorkDist)
# Fitting our models after new features
LmModel <- lm.model(training)
GlmnetModel <- Glmnet.model(training)
TreeModel <- Tree.model(training)
RandomForestModel <- RandomForest.model(training)
comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "New features"))
comparisonDF
## Model Accuracy Treatment
## 1 Linear 0.8246119 None
## 2 Glmnet 0.8532118 None
## 3 Tree 0.8364193 None
## 4 RandomForest 0.8297109 None
## 5 Linear 0.8246119 RemovingOutliersCoef4
## 6 Glmnet 0.8532118 RemovingOutliersCoef4
## 7 Tree 0.8364193 RemovingOutliersCoef4
## 8 RandomForest 0.8297109 RemovingOutliersCoef4
## 9 Linear 0.8195841 bucketizing Sons
## 10 Glmnet 0.8532260 bucketizing Sons
## 11 Tree 0.8364193 bucketizing Sons
## 12 RandomForest 0.8296966 bucketizing Sons
## 13 Linear 0.8178749 bucketizing Pets
## 14 Glmnet 0.8532260 bucketizing Pets
## 15 Tree 0.8364193 bucketizing Pets
## 16 RandomForest 0.8296824 bucketizing Pets
## 17 Linear 0.8195983 bucketizing Reassons
## 18 Glmnet 0.8481698 bucketizing Reassons
## 19 Tree 0.8364193 bucketizing Reassons
## 20 RandomForest 0.8279447 bucketizing Reassons
## 21 Linear 0.8195841 bucketizing Education
## 22 Glmnet 0.8532260 bucketizing Education
## 23 Tree 0.8364193 bucketizing Education
## 24 RandomForest 0.8279163 bucketizing Education
## 25 Linear 0.8195983 New features
## 26 Glmnet 0.8515596 New features
## 27 Tree 0.8364193 New features
## 28 RandomForest 0.8245834 New features
Not really improving!! what is next..
# Removing them
training$MonOrFri <- NULL
training$Transp_avg_cost <- NULL
training$log_dist <- NULL
training_morebuck <- training
training_morebuck$BodyMassIndex <-.bincode(training_morebuck$BodyMassIndex, c(18.5, 24.9, 29.99, 34.9, 39.99), TRUE, TRUE)
training_morebuck$BodyMassIndex <- as.factor(recode(training_morebuck$BodyMassIndex, '1' = "Normal", '2' = "Overweight", '3' = "Obesity", '4' = "ExtremeObesity"))
# Age & years of service
training_morebuck$Age <-as.factor(.bincode(training_morebuck$Age, c(18, 30, 45, 60), TRUE, TRUE))
# Fitting our models after new features
LmModel <- lm.model(training_morebuck)
GlmnetModel <- Glmnet.model(training_morebuck)
TreeModel <- Tree.model(training_morebuck)
RandomForestModel <- RandomForest.model(training_morebuck)
comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "More bucketising"))
comparisonDF
## Model Accuracy Treatment
## 1 Linear 0.8246119 None
## 2 Glmnet 0.8532118 None
## 3 Tree 0.8364193 None
## 4 RandomForest 0.8297109 None
## 5 Linear 0.8246119 RemovingOutliersCoef4
## 6 Glmnet 0.8532118 RemovingOutliersCoef4
## 7 Tree 0.8364193 RemovingOutliersCoef4
## 8 RandomForest 0.8297109 RemovingOutliersCoef4
## 9 Linear 0.8195841 bucketizing Sons
## 10 Glmnet 0.8532260 bucketizing Sons
## 11 Tree 0.8364193 bucketizing Sons
## 12 RandomForest 0.8296966 bucketizing Sons
## 13 Linear 0.8178749 bucketizing Pets
## 14 Glmnet 0.8532260 bucketizing Pets
## 15 Tree 0.8364193 bucketizing Pets
## 16 RandomForest 0.8296824 bucketizing Pets
## 17 Linear 0.8195983 bucketizing Reassons
## 18 Glmnet 0.8481698 bucketizing Reassons
## 19 Tree 0.8364193 bucketizing Reassons
## 20 RandomForest 0.8279447 bucketizing Reassons
## 21 Linear 0.8195841 bucketizing Education
## 22 Glmnet 0.8532260 bucketizing Education
## 23 Tree 0.8364193 bucketizing Education
## 24 RandomForest 0.8279163 bucketizing Education
## 25 Linear 0.8195983 New features
## 26 Glmnet 0.8515596 New features
## 27 Tree 0.8364193 New features
## 28 RandomForest 0.8245834 New features
## 29 Linear 0.8212648 More bucketising
## 30 Glmnet 0.8532830 More bucketising
## 31 Tree 0.8364193 More bucketising
## 32 RandomForest 0.8314200 More bucketising
Also not improving our best model
Remove those features that are not significant considering the experience on this field and observations during section 3. - Month: we keep seasons - Disc.Failure: unbalanced - Education: replaced by high education - Weight and Height: info on BMI
light_train <- training[,!(names(training) %in% c("MonthOfAbs", "DiscFailure", "Education", "Weight", "Height"))]
LmModel <- lm.model(light_train)
GlmnetModel <- Glmnet.model(light_train)
TreeModel <- Tree.model(light_train)
RandomForestModel <- RandomForest.model(light_train)
## note: only 45 unique complexity parameters in default grid. Truncating the grid to 45 .
comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "previous FE + removing features, no FSel"))
comparisonDF
## Model Accuracy Treatment
## 1 Linear 0.8246119 None
## 2 Glmnet 0.8532118 None
## 3 Tree 0.8364193 None
## 4 RandomForest 0.8297109 None
## 5 Linear 0.8246119 RemovingOutliersCoef4
## 6 Glmnet 0.8532118 RemovingOutliersCoef4
## 7 Tree 0.8364193 RemovingOutliersCoef4
## 8 RandomForest 0.8297109 RemovingOutliersCoef4
## 9 Linear 0.8195841 bucketizing Sons
## 10 Glmnet 0.8532260 bucketizing Sons
## 11 Tree 0.8364193 bucketizing Sons
## 12 RandomForest 0.8296966 bucketizing Sons
## 13 Linear 0.8178749 bucketizing Pets
## 14 Glmnet 0.8532260 bucketizing Pets
## 15 Tree 0.8364193 bucketizing Pets
## 16 RandomForest 0.8296824 bucketizing Pets
## 17 Linear 0.8195983 bucketizing Reassons
## 18 Glmnet 0.8481698 bucketizing Reassons
## 19 Tree 0.8364193 bucketizing Reassons
## 20 RandomForest 0.8279447 bucketizing Reassons
## 21 Linear 0.8195841 bucketizing Education
## 22 Glmnet 0.8532260 bucketizing Education
## 23 Tree 0.8364193 bucketizing Education
## 24 RandomForest 0.8279163 bucketizing Education
## 25 Linear 0.8195983 New features
## 26 Glmnet 0.8515596 New features
## 27 Tree 0.8364193 New features
## 28 RandomForest 0.8245834 New features
## 29 Linear 0.8212648 More bucketising
## 30 Glmnet 0.8532830 More bucketising
## 31 Tree 0.8364193 More bucketising
## 32 RandomForest 0.8314200 More bucketising
## 33 Linear 0.8448084 previous FE + removing features, no FSel
## 34 Glmnet 0.8566016 previous FE + removing features, no FSel
## 35 Tree 0.8364193 previous FE + removing features, no FSel
## 36 RandomForest 0.8364193 previous FE + removing features, no FSel
Well it seems that removing theese variables we actually gain some accuracy. Lets keep the change then.
training <- light_train
for this analysis we considered all the variables.
training_raw <- dataset[1:593,]
chisquared <- data.frame(chi.squared(Absenteeism~., training_raw[names(training_raw)]))
chisquared$features <- rownames(chisquared)
# Plot the result, and remove those below the 1st IQR (inter-quartile-range) --aggressive
par(mfrow=c(1,2))
boxplot(chisquared$attr_importance)
bp.stats <- boxplot.stats(chisquared$attr_importance)$stats # Get the statistics from the boxplot
chisquared.threshold = bp.stats[2] # This element represent the 1st quartile (more on: https://www.math.ucla.edu/~anderson/rw1001/library/base/html/boxplot.stats.html).
text(y = bp.stats, labels = bp.stats, x = 1.3, cex=0.7)
barplot(sort(chisquared$attr_importance), names.arg = chisquared[order(chisquared$attr_importance),]$features, cex.names = 0.6, las=2, horiz = T)
abline(v=chisquared.threshold, col='red') # Draw a red line over the 1st IQR
dev.off()
## null device
## 1
####### move red line to see if it improves the results
# Determine what features to remove from the dataset.
features_to_remove <- as.character(chisquared[chisquared$attr_importance <= chisquared.threshold, "features"])
chi_squared_model = lm.model(training_raw[!names(training_raw) %in% features_to_remove])
confusionMatrix(chi_squared_model, "none")
## Cross-Validated (5 fold, repeated 1 times) Confusion Matrix
##
## (entries are un-normalized aggregated counts)
##
## Reference
## Prediction 0 1
## 0 311 32
## 1 64 186
##
## Accuracy (average) : 0.8381
LmModel <- lm.model(training_raw[!names(training_raw) %in% features_to_remove])
GlmnetModel <- Glmnet.model(training_raw[!names(training_raw) %in% features_to_remove])
TreeModel <- Tree.model(training_raw[!names(training_raw) %in% features_to_remove])
RandomForestModel <- RandomForest.model(training_raw[!names(training_raw) %in% features_to_remove])
comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "with chisquared FS"))
comparisonDF
## Model Accuracy Treatment
## 1 Linear 0.8246119 None
## 2 Glmnet 0.8532118 None
## 3 Tree 0.8364193 None
## 4 RandomForest 0.8297109 None
## 5 Linear 0.8246119 RemovingOutliersCoef4
## 6 Glmnet 0.8532118 RemovingOutliersCoef4
## 7 Tree 0.8364193 RemovingOutliersCoef4
## 8 RandomForest 0.8297109 RemovingOutliersCoef4
## 9 Linear 0.8195841 bucketizing Sons
## 10 Glmnet 0.8532260 bucketizing Sons
## 11 Tree 0.8364193 bucketizing Sons
## 12 RandomForest 0.8296966 bucketizing Sons
## 13 Linear 0.8178749 bucketizing Pets
## 14 Glmnet 0.8532260 bucketizing Pets
## 15 Tree 0.8364193 bucketizing Pets
## 16 RandomForest 0.8296824 bucketizing Pets
## 17 Linear 0.8195983 bucketizing Reassons
## 18 Glmnet 0.8481698 bucketizing Reassons
## 19 Tree 0.8364193 bucketizing Reassons
## 20 RandomForest 0.8279447 bucketizing Reassons
## 21 Linear 0.8195841 bucketizing Education
## 22 Glmnet 0.8532260 bucketizing Education
## 23 Tree 0.8364193 bucketizing Education
## 24 RandomForest 0.8279163 bucketizing Education
## 25 Linear 0.8195983 New features
## 26 Glmnet 0.8515596 New features
## 27 Tree 0.8364193 New features
## 28 RandomForest 0.8245834 New features
## 29 Linear 0.8212648 More bucketising
## 30 Glmnet 0.8532830 More bucketising
## 31 Tree 0.8364193 More bucketising
## 32 RandomForest 0.8314200 More bucketising
## 33 Linear 0.8448084 previous FE + removing features, no FSel
## 34 Glmnet 0.8566016 previous FE + removing features, no FSel
## 35 Tree 0.8364193 previous FE + removing features, no FSel
## 36 RandomForest 0.8364193 previous FE + removing features, no FSel
## 37 Linear 0.8380715 with chisquared FS
## 38 Glmnet 0.8582823 with chisquared FS
## 39 Tree 0.8364193 with chisquared FS
## 40 RandomForest 0.8381427 with chisquared FS
Good results fiting the model with the important variables of chisquared model Tried different options moving the threshold: not significant results. we set it again in zero.
weights<- data.frame(information.gain(Absenteeism~., training_raw))
weights$feature <- rownames(weights)
weights[order(weights$attr_importance, decreasing = TRUE),]
## attr_importance feature
## Reason 0.309119821 Reason
## WeekDay 0.024079811 WeekDay
## DiscFailure 0.023961909 DiscFailure
## TranspExp 0.022994975 TranspExp
## Son 0.019335481 Son
## MonthOfAbs 0.017304305 MonthOfAbs
## Height 0.016319739 Height
## Seasons 0.004755832 Seasons
## Smoker 0.003853220 Smoker
## Education 0.002718730 Education
## Drinker 0.002333402 Drinker
## WorkDist 0.000000000 WorkDist
## ServTime 0.000000000 ServTime
## Age 0.000000000 Age
## WorkLoadAvgDay 0.000000000 WorkLoadAvgDay
## Hit 0.000000000 Hit
## Pet 0.000000000 Pet
## Weight 0.000000000 Weight
## BodyMassIndex 0.000000000 BodyMassIndex
information_gain_features <- weights$feature[weights$attr_importance > 0.015]
information_gain_features_to_remove <- weights$feature[weights$attr_importance < 0.015]
training_gm <- training_raw[,!names(training_raw) %in% information_gain_features_to_remove]
information_gain_model = lm.model(training_gm)
confusionMatrix(information_gain_model, 'none')
## Cross-Validated (5 fold, repeated 1 times) Confusion Matrix
##
## (entries are un-normalized aggregated counts)
##
## Reference
## Prediction 0 1
## 0 311 32
## 1 64 186
##
## Accuracy (average) : 0.8381
LmModel <- lm.model(training_raw[,!names(training_raw) %in% information_gain_features_to_remove])
GlmnetModel <- Glmnet.model(training_raw[,!names(training_raw) %in% information_gain_features_to_remove])
TreeModel <- Tree.model(training_raw[,!names(training_raw) %in% information_gain_features_to_remove])
RandomForestModel <- RandomForest.model(training_raw[,!names(training_raw) %in% information_gain_features_to_remove])
## note: only 46 unique complexity parameters in default grid. Truncating the grid to 46 .
comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "with InformationGain FS"))
comparisonDF
## Model Accuracy Treatment
## 1 Linear 0.8246119 None
## 2 Glmnet 0.8532118 None
## 3 Tree 0.8364193 None
## 4 RandomForest 0.8297109 None
## 5 Linear 0.8246119 RemovingOutliersCoef4
## 6 Glmnet 0.8532118 RemovingOutliersCoef4
## 7 Tree 0.8364193 RemovingOutliersCoef4
## 8 RandomForest 0.8297109 RemovingOutliersCoef4
## 9 Linear 0.8195841 bucketizing Sons
## 10 Glmnet 0.8532260 bucketizing Sons
## 11 Tree 0.8364193 bucketizing Sons
## 12 RandomForest 0.8296966 bucketizing Sons
## 13 Linear 0.8178749 bucketizing Pets
## 14 Glmnet 0.8532260 bucketizing Pets
## 15 Tree 0.8364193 bucketizing Pets
## 16 RandomForest 0.8296824 bucketizing Pets
## 17 Linear 0.8195983 bucketizing Reassons
## 18 Glmnet 0.8481698 bucketizing Reassons
## 19 Tree 0.8364193 bucketizing Reassons
## 20 RandomForest 0.8279447 bucketizing Reassons
## 21 Linear 0.8195841 bucketizing Education
## 22 Glmnet 0.8532260 bucketizing Education
## 23 Tree 0.8364193 bucketizing Education
## 24 RandomForest 0.8279163 bucketizing Education
## 25 Linear 0.8195983 New features
## 26 Glmnet 0.8515596 New features
## 27 Tree 0.8364193 New features
## 28 RandomForest 0.8245834 New features
## 29 Linear 0.8212648 More bucketising
## 30 Glmnet 0.8532830 More bucketising
## 31 Tree 0.8364193 More bucketising
## 32 RandomForest 0.8314200 More bucketising
## 33 Linear 0.8448084 previous FE + removing features, no FSel
## 34 Glmnet 0.8566016 previous FE + removing features, no FSel
## 35 Tree 0.8364193 previous FE + removing features, no FSel
## 36 RandomForest 0.8364193 previous FE + removing features, no FSel
## 37 Linear 0.8380715 with chisquared FS
## 38 Glmnet 0.8582823 with chisquared FS
## 39 Tree 0.8364193 with chisquared FS
## 40 RandomForest 0.8381427 with chisquared FS
## 41 Linear 0.8380573 with InformationGain FS
## 42 Glmnet 0.8549637 with InformationGain FS
## 43 Tree 0.8364193 with InformationGain FS
## 44 RandomForest 0.8329725 with InformationGain FS
Results with features selected by X`2 are better so we discard this selection.
# train_control_config_4_stepwise <- trainControl(method = "none", classProbs = TRUE, verboseIter = TRUE, allowParallel = TRUE)
#
# #To make strings or factors valid names in R
# training_Wrapper <- sapply(training_raw, make.names)
#
# registerDoMC(cores=4)
# backward.lm.mod <- train(Absenteeism ~ ., data = training_Wrapper,
# method = "glmStepAIC",
# direction = "backward",
# trace = FALSE,
# metric = "Accuracy",
# trControl=train_control_config_4_stepwise,
# num.threads = 4
# )
#
#
# paste("Features Selected" ,backward.lm.mod$finalModel$formula[3])
backward lm model took very long time to run so we decided not to include it in our analysis.
After testing a lot of different feature ingeneering techniques we come to the conclusion that our best predictor model is the Ridge one trained without variables excluded by Chisquared (threshold = 0) test and with no Feature ingeneering to our great regret..
Results of all our trials can be seen here: ## See the evolution of the results between models and fittings.
print("Our best bet:")
## [1] "Our best bet:"
comparisonDF[comparisonDF$Accuracy == max(comparisonDF$Accuracy), ]
## Model Accuracy Treatment
## 38 Glmnet 0.8582823 with chisquared FS
write.csv(comparisonDF, file = 'comparisonDF.csv', row.names = FALSE)
# Train the model using all the data
# This time we will create a bigger grid to fine tune our hyper parameters.
# Also we will do 2 repetitions to reduce overfitting as much as we can.
finalTrain <- training_raw[!names(training_raw) %in% features_to_remove]
finalTest <- test[!names(test) %in% features_to_remove]
Glmnet.model.tunning <- function(training_dataset) {
trainControlLasso <- trainControl(method = "repeatedcv",
number = 5,
repeats = 2,
returnResamp = "all",
verboseIter = F,
sampling = "up")
gridGlmnet<- expand.grid(alpha = seq(0, 1, by = 0.01),
lambda = seq(0.001, 0.05, by = 0.01))
set.seed(123)
this.model <- train(Absenteeism ~ ., data = training_dataset,
method = "glmnet",
metric = "Accuracy",
preProc = c("center", "scale"),
trControl = trainControlLasso,
tuneGrid = gridGlmnet
)
return(this.model)
}
bestTuneModel <- Glmnet.model.tunning(finalTrain)
print(bestTuneModel$bestTune)
## alpha lambda
## 5 0 0.041
finalModel <- function(training_dataset) {
trainControlLasso <- trainControl(method = "none")
gridGlmnet<- expand.grid(alpha = 0,
lambda = 0.041)
this.model <- train(Absenteeism ~ ., data = training_dataset,
method = "glmnet",
metric = "Accuracy",
preProc = c("center", "scale"),
trControl = trainControlLasso,
tuneGrid = gridGlmnet
)
return(this.model)
}
final.model <- finalModel(finalTrain)
# Predict the prices for the test data (i.e., we use the exp function to revert the log transformation that we applied to the target variable)
final.pred <- predict(final.model, finalTest, type = "raw")
predictions <- data.frame(ID = testId$ID, Absenteeism.time.in.hours= (final.pred))
colnames(predictions) <-c("ID", "Absenteeism")
write.csv(predictions, file = "predictions.csv", row.names = FALSE)
This model is ready to predict 85% of the Absenteeism. The most significant atribute is reason of absence as per ChiSquare analysis.